library(tidyverse)Linear Models with Matrices
Previously: Reproducible Projects in R
Lesson plan
My goals are to demonstrate:
- The underlying mechanics of linear fixed-effects models using matrix operations in R
- The use of
lm()for ANOVA and regression
First we’ll use a small “toy” dataset. Easy to see the matrices.
Then I’ll introduce a real dataset, give you a hint how to subset it, and turn you loose to analyze it yourself.
Steps we’ll do:
- Pose a question
- Write the model
- Estimate parameters with OLS
- Solve “by hand” in R
- Using the
lm()function
- Partition variability: ANOVA
- Hypothesis testing: F-test, t-test - is the effect and/or the variance “significant”?
Simple example*
Our two favorite cultivars of clover:
- Cultivar A, the 4-leaf 🍀
- Cultivar B, the 3-leaf ☘️
Which one is really best?
We plant 8 plots. Four reps each cultivar.
What influences variation in clover biomass?
- Genetics? i.e. the cultivar identity (A vs. B)
- The average performance of clovers? An overall (or grand) mean.
- Error (random and otherwise)
*I’m borrowing from the excellent example by Dr. Felipe Ferrão here, also the example from Ch. 2 of Isik, Holland, and Maltecca (2017).
Firstly, the model can be visually represented, first as a per-plant equation, then in a statistical, matrix notation, in words and a plot Figure 1.
We make some simplifying assumptions, always.
We are going to do hypothesis testing. For now, to determine if variances sources are contributing “significantly” to the data (F-test) and if the cultivar means are different (t-test).
This is an important part of some of our main downstream goals, like marker-trait association (GWAS).
Assumptions of linear models are:
- Independence of errors (no ‘autocorrelation’)
- Lack of multi-collinearity (predictors not too correlated)
ANOVA with lm()
Two parts of Analysis of Variance
- Compute means
- Compute deviations (sums of squares, mean squares)
And we can write the classical ANOVA table. Variance is partitioned into two components: Among Groups and Within Groups.
| Source of Variation | Degrees of Freedom (DF) | Sum of Squares (SS) | Mean Square (MS) | F-value |
|---|---|---|---|---|
| Among Groups | \(\alpha−1\) | \(SS_A=\sum_{i=1}^{\alpha} n_i ( \bar{y_i\cdot} - \bar{y\cdot\cdot})^2\) | \(MS_A = \frac{SS_A}{\alpha-1}\) | \(𝐹=\frac{MS_A}{MS_E}\) |
| Within Groups | \(𝑁−\alpha\) | \(SS_E = \sum_{i=1}^\alpha \sum_{j=1}^{n_i} (y_{ij} - \bar{y_i\cdot})^2\) | \(MS_E = \frac{SS_E}{N-\alpha}\) | |
| Total | \(𝑁−1\) | \(SS_T = \sum_{i=1}^\alpha \sum_{j=1}^{n_i} (y_{ij} - \bar{y_{\cdot\cdot}})^2\) |
Where:
- \(\alpha\) = number of groups
- \(n\) = number of observations in group 𝑖
- \(N\) = total number of observations
- \(\bar{y_{i\cdot}}\) = mean of group 𝑖
- \(\bar{y_{\cdot\cdot}}\) = grand mean
Our goal is to partition tot. variance into sources.
We compare the variance among groups (e.g. among genotypes) to the residual variance using a ratio of variances (the F-value).
Basically, the F-test is looking for greater signal compared to noise. A high F-value means among groups > residual and so group variance is probably important.
# Create an example
df <- data.frame(Replicate = 1:4,
GenoA = c(14, 17, 15, 15),
GenoB = c(11, 13, 10, 12))
df <- df %>%
# this will change the "wide" data frame to a "long" one
## it stacks the values of GenoA and GenoB into a column labelled "Biomass"
## and the col headers go to a new column "Genotype"
pivot_longer(cols = c(GenoA, GenoB),
names_to = "Genotype",
values_to = "Biomass") %>%
# Make genotype and replicate into factors
mutate(Genotype=as.factor(Genotype),
# Replicate needs to be a factor-type
# because it should be treated as a categorical NOT a numerical variable
### Replicate 4 doesn't come after Replicate 3 in any logical way.
Replicate=as.factor(Replicate))
fit.lm <- lm(Biomass ~ Genotype, data = df)
summary(fit.lm)
Call:
lm(formula = Biomass ~ Genotype, data = df)
Residuals:
Min 1Q Median 3Q Max
-1.5000 -0.6875 -0.2500 0.7500 1.7500
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 15.2500 0.6374 23.93 3.5e-07 ***
GenotypeGenoB -3.7500 0.9014 -4.16 0.00594 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.275 on 6 degrees of freedom
Multiple R-squared: 0.7426, Adjusted R-squared: 0.6997
F-statistic: 17.31 on 1 and 6 DF, p-value: 0.005943
fit.aov<-aov(Biomass ~ Genotype, data = df)
summary(fit.aov) Df Sum Sq Mean Sq F value Pr(>F)
Genotype 1 28.12 28.125 17.31 0.00594 **
Residuals 6 9.75 1.625
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
t.test(Biomass~Genotype,data = df)
Welch Two Sample t-test
data: Biomass by Genotype
t = 4.1603, df = 5.9961, p-value = 0.005951
alternative hypothesis: true difference in means between group GenoA and group GenoB is not equal to 0
95 percent confidence interval:
1.544032 5.955968
sample estimates:
mean in group GenoA mean in group GenoB
15.25 11.50
Notice the way the coefficients are reported from lm().
It shows only an effect for (Intercept) and GenotypeGenoB.
Initially, we wrote the model as
\[ y_{ij} = \mu + \tau_i + \epsilon_{ij} \]
\(i\) genotypes (1,2), \(j\) replications (1,2,3,4)
So our matrices and vectors looked like this:
\[ \begin{bmatrix} y_{11} \\ y_{12} \\ y_{13} \\ y_{14} \\ y_{21} \\ y_{22} \\ y_{23} \\ y_{24} \end{bmatrix} = \begin{bmatrix} 1 & 0 & 0 \\ 1 & 1 & 0 \\ 1 & 1 & 0 \\ 1 & 1 & 0 \\ 1 & 0 & 1 \\ 1 & 0 & 1 \\ 1 & 0 & 1 \\ 1 & 0 & 1 \end{bmatrix} \begin{bmatrix} \mu \\ \tau_1 \\ \tau_2 \end{bmatrix} + \begin{bmatrix} \epsilon_{11} \\ \epsilon_{12} \\ \epsilon_{13} \\ \epsilon_{14} \\ \epsilon_{21} \\ \epsilon_{22} \\ \epsilon_{23} \\ \epsilon_{24} \end{bmatrix} \]
This specification implies 3 coefficients would be estimated (\(\mu\), \(\tau_1\) and \(\tau_2\)).
Parameter estimation with OLS
To see why the difference, let’s first calculate the OLS solution “by hand”
Recall from class that linear models with fixed-effects and simple residuals can be solved with a closed form approach that minimizes the residual sum of squares (RSS). This is called the “Ordinary Least Squares” solution (OLS).
\[ \hat{\beta} = (X'X)^{-1}X'y \]
df# A tibble: 8 × 3
Replicate Genotype Biomass
<fct> <fct> <dbl>
1 1 GenoA 14
2 1 GenoB 11
3 2 GenoA 17
4 2 GenoB 13
5 3 GenoA 15
6 3 GenoB 10
7 4 GenoA 15
8 4 GenoB 12
y<-df$Biomass
# Because R is too smart, we'll have to make the X matrix shown above for a demo
X<-matrix(c(1,1,1,1,1,1,1,1,
1,0,1,0,1,0,1,0,
0,1,0,1,0,1,0,1),
ncol=3)
X [,1] [,2] [,3]
[1,] 1 1 0
[2,] 1 0 1
[3,] 1 1 0
[4,] 1 0 1
[5,] 1 1 0
[6,] 1 0 1
[7,] 1 1 0
[8,] 1 0 1
R provides a function called model.matrix() for constructing design matrices. What model.matrix() does, is what lm() and other software do.
X1<-model.matrix(Biomass~Genotype,data = df)
X1 (Intercept) GenotypeGenoB
1 1 0
2 1 1
3 1 0
4 1 1
5 1 0
6 1 1
7 1 0
8 1 1
attr(,"assign")
[1] 0 1
attr(,"contrasts")
attr(,"contrasts")$Genotype
[1] "contr.treatment"
XtX<-t(X)%*%X
Xty<-t(X)%*%y
beta_hat<-solve(XtX)%*%XtyError in `solve.default()`:
! Lapack routine dgesv: system is exactly singular: U[3,3] = 0
Singular = a square matrix that is not invertible
A related term is “Rank” which refers to the number of linearly independent columns of a matrix.
# What is the "rank" of XtX? Want it to be 3.
qr(XtX)$rank[1] 2
The XtX matrix is not “full rank” because columns 2 and 3 are numerical mirrors.
There aren’t enough degrees of freedom to express it this way.
In R, functions like lm() will automatically constrain the model by contrasting levels against a reference (the first level of the factor).
So \(\tau_1 = 0\) and \(\tau_2=1\).
\[ \begin{bmatrix} y_{11} \\ y_{12} \\ y_{13} \\ y_{14} \\ y_{21} \\ y_{22} \\ y_{23} \\ y_{24} \end{bmatrix} = \begin{bmatrix} 1 & 0 \\ 1 & 0 \\ 1 & 0 \\ 1 & 0 \\ 1 & 1 \\ 1 & 1 \\ 1 & 1 \\ 1 & 1 \end{bmatrix} \begin{bmatrix} \mu \\ \tau_2 \end{bmatrix} + \begin{bmatrix} \epsilon_{11} \\ \epsilon_{12} \\ \epsilon_{13} \\ \epsilon_{14} \\ \epsilon_{21} \\ \epsilon_{22} \\ \epsilon_{23} \\ \epsilon_{24} \end{bmatrix} \]
beta_hat<-solve(t(X1)%*%X1)%*%(t(X1)%*%y)
beta_hat [,1]
(Intercept) 15.25
GenotypeGenoB -3.75
fit.lm$coefficients (Intercept) GenotypeGenoB
15.25 -3.75
Ta da! 🎉
Check assumptions
hist(residuals(fit.lm))
plot(x = fitted(fit.lm), y = residuals(fit.lm))
plot(fit.lm)Estimated marginal means
Our goal is often to compare the mean of different treatment groups, for example different genotypes.
You may have encountered “lsmeans” or “emmeans” in the literature? Or if you use SAS.
Least squares means (LSMeans)—now more commonly called estimated marginal means (EMMs)—are model-based estimates of factor level means derived from a fitted linear model (LM). They are related to the “best linear unbiased estimators” (BLUEs) in a linear mixed model; we’ll get to all that later.
library(emmeans)Warning: package 'emmeans' was built under R version 4.5.2
Welcome to emmeans.
Caution: You lose important information if you filter this package's results.
See '? untidy'
emmeans(fit.lm,"Genotype") Genotype emmean SE df lower.CL upper.CL
GenoA 15.2 0.637 6 13.69 16.8
GenoB 11.5 0.637 6 9.94 13.1
Confidence level used: 0.95
EMMs are not arithmetic means. They are predicted means from the fitted model. If there were additional factors in our model, they would evaluate our “Genotypes” averaging over the other factors/covariates.
Look at how they are related to the coefficients from the OLS model.
summary(fit.lm)
Call:
lm(formula = Biomass ~ Genotype, data = df)
Residuals:
Min 1Q Median 3Q Max
-1.5000 -0.6875 -0.2500 0.7500 1.7500
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 15.2500 0.6374 23.93 3.5e-07 ***
GenotypeGenoB -3.7500 0.9014 -4.16 0.00594 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.275 on 6 degrees of freedom
Multiple R-squared: 0.7426, Adjusted R-squared: 0.6997
F-statistic: 17.31 on 1 and 6 DF, p-value: 0.005943
Hands-on
Some real data
Download the CCB Crimson Clover ALT dataset from Canvas.
Put it in the data/ sub-directory of your project.
library(tidyverse)Import the dataset
alt<-read.csv(here::here("data","19-25CC_ALT_DATA_ALL_2025-08-23.csv"))
str(alt)'data.frame': 3058 obs. of 16 variables:
$ X : int 1 2 3 4 5 6 7 8 9 10 ...
$ site : chr "AL" "AL" "AL" "AL" ...
$ year : int 24 24 24 24 24 24 24 24 24 24 ...
$ site.year : chr "24AL" "24AL" "24AL" "24AL" ...
$ unique.id : chr "24ALCC_ALT-01-01" "24ALCC_ALT-01-02" "24ALCC_ALT-01-03" "24ALCC_ALT-02-01" ...
$ entry : chr "Aldo" "21NCCCLH" "23MDCC" "21NCCCE" ...
$ rep : int 1 1 1 1 1 1 1 1 1 1 ...
$ row : int 1 2 3 1 2 3 1 2 3 1 ...
$ range : int 1 1 1 2 2 2 3 3 3 4 ...
$ biomass.1 : num 56.3 46.4 55.8 79.8 60.4 ...
$ growth.stage.1 : num 9 7 7 9 8 8 9 9 8 7 ...
$ fallstandcount : int NA NA NA NA NA NA NA NA NA NA ...
$ springstandcount: int 95 95 95 95 95 75 75 95 75 95 ...
$ fall.vigor.av : num 5 5 5 5 5 9 5 5 9 5 ...
$ spring.vigor.av : num 9 5 5 9 5 5 5 5 5 5 ...
$ wintersurvival : num NA NA NA NA NA NA NA NA NA NA ...
alt %>% head X site year site.year unique.id entry rep row range biomass.1
1 1 AL 24 24AL 24ALCC_ALT-01-01 Aldo 1 1 1 56.33333
2 2 AL 24 24AL 24ALCC_ALT-01-02 21NCCCLH 1 2 1 46.41667
3 3 AL 24 24AL 24ALCC_ALT-01-03 23MDCC 1 3 1 55.75000
4 4 AL 24 24AL 24ALCC_ALT-02-01 21NCCCE 1 1 2 79.75000
5 5 AL 24 24AL 24ALCC_ALT-02-02 23NCCCE Elite 1 2 2 60.41667
6 6 AL 24 24AL 24ALCC_ALT-02-03 Dixie 1 3 2 57.16667
growth.stage.1 fallstandcount springstandcount fall.vigor.av spring.vigor.av
1 9 NA 95 5 9
2 7 NA 95 5 5
3 7 NA 95 5 5
4 9 NA 95 5 9
5 8 NA 95 5 5
6 8 NA 75 9 5
wintersurvival
1 NA
2 NA
3 NA
4 NA
5 NA
6 NA
About
The CCBN “Advanced Line Trial” (ALT) for Crimson Clover (Trifolium incarnatum) is a competitive variety trial. New “advanced line” bulks, created from top plants in CCBN nurseries, are compared against current best lines from other programs/years and commercial checks at up to 15 locations. Trials are conducted in randomized complete block designs with 15’ single-row plots flanked by triticale for competition; they assess biomass, vigor, winter survival, seed yield.
site: State Location Codeyear: Year HARVESTED (trials are planted in Fall, harvested Spring of following year)site.year: site-year combinationunique.id: unique plot IDentry: Crimson Clover variety name.rep: integer - specifies a complete block within a given site.yearrow: Corresponds to one full planter pass. I’d prefer these to be called columns to match matrix conventions (row – by - column). But that’s how CCB records it (see Figure 4).range: Spanning a plot in the direct the planter drives. In matrix convention, this would be called the row 🤦 (see Figure 4).biomass.1: plot dry mass in grams-per-linear-foot-harvestedgrowth.stage.1: Khalu & Fick growth stage at time of harvestfall/springstandcount: percent of plot out of 100 with living plants, before/after winter.fall/spring.vigor.av: visual rating of plot vigor scored on a time-specific 1-9 scale (i.e. 1 = worst, 5 = average, 9 = best).wintersurvival: spring / fall standcount.
Your task
- Find a balanced chunk of the data.
- Balance is going to mean a lack of missing data for most or all traits.
- I suggest taking out 2-4 site.years worth.
- Subset the dataset.
- Pick a trait and follow the steps we did above.
- Start simple with your modelling.
Questions:
- What are the potential sources of variation in the data? Write the model.
- What sources of variation are significant contributors to the model fit?
- Which are the “best”performing Crimson Clovers?
Data summarization
To find a good data chunk, you might want to make some summaries.
Here’s a nice tidy way to computing summary statistics on a dataset like this, using group_by() and summarize()
# Example
alt %>%
filter(site=="AL") %>%
group_by(site.year) %>%
summarize(PropMissingBiomass=length(which(is.na(biomass.1)))/n(),
Nentries=length(unique(entry)),
Nobs=n())# A tibble: 2 × 4
site.year PropMissingBiomass Nentries Nobs
<chr> <dbl> <int> <int>
1 24AL 0 28 56
2 25AL 0 30 60
# Does the code for computing "PropMissingBiomass" look confusing?
# To learn, break down each piece of the code and see what it does
# Like this
## alt$biomass.1 %>% is.na
## alt$biomass.1 %>% is.na %>% which
## length(which(is.na(alt$biomass.1)))/nrow(alt)Having chosen the 2 AL site-years (for an example), I can split off a new data.frame for future use.
al_alt<-alt %>%
filter(site=="AL")Plotting data
I like to use ggplot for plotting in R.
A great place to learn that is the Visualize chapter in R for Data Science.
To show it off, I’ll make a histogram of the distribution of each trait in the dataset.
A bit of data transformation is needed, using pivot_longer to make a long-form dataset.
alt %>%
pivot_longer(cols = c(biomass.1:wintersurvival),
names_to = "Trait",
values_to = "Value") %>%
ggplot(.,aes(x=Value,fill=Trait)) + geom_density() +
facet_wrap(~Trait, scales='free')Warning: Removed 4531 rows containing non-finite outside the scale range
(`stat_density()`).
Next
That’s it for today.
In the next session(s) we will introduce linear mixed models and related concepts. We’ll return with Hands-on Lesson 3 on Mixed Models and learn 3 packages (lme4::lmer(), sommer::mmes(), asreml()).